Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation

نویسنده

  • Daniel Ting
چکیده

We introduce and study a new data sketch for processing massive datasets. It addresses two common problems: 1) computing a sum given arbitrary filter conditions and 2) identifying the frequent items or heavy hitters in a data set. For the former, the sketch provides unbiased estimates with state of the art accuracy. It handles the challenging scenario when the data is disaggregated. In this case, a per unit metric of interest can only be computed as an expensive pre-aggregation of the raw, disaggregated data. For example, the metric of interest may be total clicks per user while the raw data is a click stream containing multiple rows per user. Thus the sketch is suitable for use in a wide range of applications including computing historical click through rates for ad prediction, reporting user metrics from event streams, and measuring network traffic for IP flows. We prove and empirically show the sketch has good properties for both the disaggregated subset sum estimation and frequent item problems. On i.i.d. data, it not only picks out the frequent items but gives strongly consistent estimates for the proportion of each frequent item. For subset sum estimation, it asymptotically draws a probability proportional to size sample that is optimal for estimating the sum over the data. For non i.i.d. data, we show that it typically does much better than random sampling for the frequent item problem and never does worse. For subset sum estimation, we show that even for pathological sequences, the variance is close to that of an optimal sampling design. Empirically, despite the disadvantage of operating on disaggregated data, our method matches or bests priority sampling, a state of the art method on pre-aggregated data. When compared to uniform sampling, it performs orders of magnitude better on skewed data. We also propose extensions to the sketch that allow it to be used in combining multiple data sets, in distributed systems, and for time decayed aggregation. 1 ar X iv :1 70 9. 04 04 8v 1 [ st at .C O ] 1 2 Se p 20 17

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Sketches for Disaggregated Subset Sum Estimation

We introduce and study a new data sketch for processing massive datasets. It addresses two common problems: 1) computing a sum given arbitrary €lter conditions and 2) identifying the frequent items or heavy hiŠers in a data set. For the former, the sketch provides unbiased estimates with state of the art accuracy. It is speci€cally designed to handle the challenging scenario when the data is di...

متن کامل

Time-decaying Sketches for Robust Aggregation of Sensor Data

We present a new sketch for summarizing network data. The sketch has the following properties which make it useful in communication-efficient aggregation in distributed streaming scenarios, such as sensor networks: the sketch is duplicate insensitive, i.e., reinsertions of the same data will not affect the sketch and hence the estimates of aggregates. Unlike previous duplicate-insensitive sketc...

متن کامل

Improvement of effort estimation accuracy in software projects using a feature selection approach

In recent years, utilization of feature selection techniques has become an essential requirement for processing and model construction in different scientific areas. In the field of software project effort estimation, the need to apply dimensionality reduction and feature selection methods has become an inevitable demand. The high volumes of data, costs, and time necessary for gathering data , ...

متن کامل

On the Variance of Subset Sum Estimation

For high volume data streams and large data warehouses, sampling is used for efficient approximate answers to aggregate queries over selected subsets. Mathematically, we are dealing with a set of weighted items and want to support queries to arbitrary subset sums. With unit weights, we can compute subset sizes which together with the previous sums provide the subset averages. The question addre...

متن کامل

Pyramid Sketch: a Sketch Framework for Frequency Estimation of Data Streams

Sketch is a probabilistic data structure, and is used to store and query the frequency of any item in a given multiset. Due to its high memory efficiency, it has been applied to various fields in computer science, such as stream database, network traffic measurement, etc. The key metrics of sketches for data streams are accuracy, speed, and memory usage. Various sketches have been proposed, but...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1709.04048  شماره 

صفحات  -

تاریخ انتشار 2017